Task

Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided).

Regression Diagnostics for Multiple Regression

Chapter 6 in A Modern Approach to Regression explains that when fitting a multiple regression model, it is important to:

  1. Determine whether the proposed regression model is a valid model (i.e., determine whether it provides an adequate fit to the data). The main tools we will use to validate regression assumptions are plots involving standardized residuals and/or fitted values. We shall see that these plots enable us to assess visually whether the assumptions are being violated and, under certain conditions, point to what should be done to overcome these violations. We shall also consider a tool, called marginal model plots, which have wider application than residual plots.

  2. Determine which (if any) of the data points have predictor values that have an unusually large effect on the estimated regression model. (Recall that such points are called leverage points.)

  3. Determine which (if any) of the data points are outliers, that is, points which do not follow the pattern set by the bulk of the data, when one takes into account the given model.

  4. Assess the effect of each predictor variable on the response variable, having adjusted for the effect of other predictor variables using added variable plots.

  5. Assess the extent of collinearity among the predictor variables using variance inflation factors

Import Data and Inferential Statistics

moneyball_training_data <- read.csv("~/Documents/DATA 621/data_621_hw_1/data/moneyball-training-data.csv", colClasses = c("NULL", rep(NA, 16))) + 1

summary(moneyball_training_data)
##   TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   :  1.00   Min.   : 892   Min.   : 70.0   Min.   :  1.00  
##  1st Qu.: 72.00   1st Qu.:1384   1st Qu.:209.0   1st Qu.: 35.00  
##  Median : 83.00   Median :1455   Median :239.0   Median : 48.00  
##  Mean   : 81.79   Mean   :1470   Mean   :242.2   Mean   : 56.25  
##  3rd Qu.: 93.00   3rd Qu.:1538   3rd Qu.:274.0   3rd Qu.: 73.00  
##  Max.   :147.00   Max.   :2555   Max.   :459.0   Max.   :224.00  
##                                                                  
##  TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
##  Min.   :  1.0   Min.   :  1.0   Min.   :   1.0   Min.   :  1.0  
##  1st Qu.: 43.0   1st Qu.:452.0   1st Qu.: 549.0   1st Qu.: 67.0  
##  Median :103.0   Median :513.0   Median : 751.0   Median :102.0  
##  Mean   :100.6   Mean   :502.6   Mean   : 736.6   Mean   :125.8  
##  3rd Qu.:148.0   3rd Qu.:581.0   3rd Qu.: 931.0   3rd Qu.:157.0  
##  Max.   :265.0   Max.   :879.0   Max.   :1400.0   Max.   :698.0  
##                                  NA's   :102      NA's   :131    
##  TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
##  Min.   :  1.0   Min.   :30.00    Min.   : 1138   Min.   :  1.0   
##  1st Qu.: 39.0   1st Qu.:51.50    1st Qu.: 1420   1st Qu.: 51.0   
##  Median : 50.0   Median :59.00    Median : 1519   Median :108.0   
##  Mean   : 53.8   Mean   :60.36    Mean   : 1780   Mean   :106.7   
##  3rd Qu.: 63.0   3rd Qu.:68.00    3rd Qu.: 1684   3rd Qu.:151.0   
##  Max.   :202.0   Max.   :96.00    Max.   :30133   Max.   :344.0   
##  NA's   :772     NA's   :2085                                     
##  TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E  TEAM_FIELDING_DP
##  Min.   :   1.0   Min.   :    1.0   Min.   :  66.0   Min.   : 53.0   
##  1st Qu.: 477.0   1st Qu.:  616.0   1st Qu.: 128.0   1st Qu.:132.0   
##  Median : 537.5   Median :  814.5   Median : 160.0   Median :150.0   
##  Mean   : 554.0   Mean   :  818.7   Mean   : 247.5   Mean   :147.4   
##  3rd Qu.: 612.0   3rd Qu.:  969.0   3rd Qu.: 250.2   3rd Qu.:165.0   
##  Max.   :3646.0   Max.   :19279.0   Max.   :1899.0   Max.   :229.0   
##                   NA's   :102                        NA's   :286

The statistics reveal that several of the observations have NA values:

which(colSums(is.na(moneyball_training_data)) > 0)
##  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS TEAM_BATTING_HBP 
##                7                8                9               10 
## TEAM_PITCHING_SO TEAM_FIELDING_DP 
##               14               16

We can either ignore these predictors entirely or ignore the observations that contain any NA’s.

The statistics show that TEAM_FIELDING_E, TEAM_PITCHING_H, TEAM_PITCHING_S0, TEAM_PITCHING_SB have a really high maximum value compared to the median and 3rd quarter.

Boxplots

for (predictor in colnames(moneyball_training_data)){
  print(ggplot(moneyball_training_data, aes(x = eval(as.name(predictor)))) +
        geom_boxplot() +
        coord_flip() +
        xlab(predictor))
}

## Warning: Removed 102 rows containing non-finite values (stat_boxplot).

## Warning: Removed 131 rows containing non-finite values (stat_boxplot).

## Warning: Removed 772 rows containing non-finite values (stat_boxplot).

## Warning: Removed 2085 rows containing non-finite values (stat_boxplot).

## Warning: Removed 102 rows containing non-finite values (stat_boxplot).

## Warning: Removed 286 rows containing non-finite values (stat_boxplot).

### Response vs. Predictor Plots

for (predictor in colnames(moneyball_training_data)){
  print(ggplot(moneyball_training_data, aes(x = eval(as.name(predictor)), y = TARGET_WINS)) +
        geom_point() +
        xlab(predictor))
}

## Warning: Removed 102 rows containing missing values (geom_point).

## Warning: Removed 131 rows containing missing values (geom_point).

## Warning: Removed 772 rows containing missing values (geom_point).

## Warning: Removed 2085 rows containing missing values (geom_point).

## Warning: Removed 102 rows containing missing values (geom_point).

## Warning: Removed 286 rows containing missing values (geom_point).

Removal of Outliers

moneyball_training_data <- filter(moneyball_training_data, TEAM_BATTING_H < 1783 & TEAM_PITCHING_H < 1783 & TEAM_PITCHING_SO < 1595 & TEAM_FIELDING_E < 886)
for (predictor in colnames(moneyball_training_data)){
  print(ggplot(moneyball_training_data, aes(x = eval(as.name(predictor)), y = TARGET_WINS)) +
        geom_point() +
        xlab(predictor))
}

## Warning: Removed 2 rows containing missing values (geom_point).

## Warning: Removed 377 rows containing missing values (geom_point).

## Warning: Removed 1583 rows containing missing values (geom_point).

## Warning: Removed 74 rows containing missing values (geom_point).

Histograms

ggplot(data = melt(moneyball_training_data, "TARGET_WINS"), aes(value)) +
  geom_histogram() +
  facet_wrap(.~variable, scales = "free")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2036 rows containing non-finite values (stat_bin).

Bimodal Predictor Variables

Create two models, one with the bimodal data, and one without the bimodal data. So it looks like TEAM_BATTING_HR, TEAM_BATTING_SO, and TEAM_PITCHING_HF. But let’s continue…

Variable Selection

In total there are 6 columns with missing values:

  • Strikeouts by batters (5%) Highly unlikely, should use median or regression model for imputation

  • Stolen bases (6%) bas_sb stolen bases weren’t tracked officially until 1887, so some of the missing data could be from 1871-1886. We could impute those values using the median or regression model.

  • Caught stealing (34%) bas_cs This statistic was not tracked until 1951. We might have to ignore this variable

  • Batter hit by pitch (92%) Replace with 0

  • Strikeouts by pitchers (4%) highly unlikely, should use median or regression model for imputation

  • Double plays (12%) highly unlikely, should use median or regression model for imputation

Quantifying Correlation for Predictor Variables

We generate the correlation plot as shown below.

corrplot(cor(moneyball_training_data, use = "na.or.complete"), method = 'number', type = 'lower', diag = FALSE, number.cex = 0.5, tl.cex = 0.5)

The variables with 1 are highly correlated with one another. The variance inflation factor for these variables is also extremely high (Go to “Linear Model with all Predictors” section.)

Notes on Missing Values

This link points out that if less than 5% of the data is missing, than it is okay to ignore the data.

This link points out that if more than 5% of the data is missing, than we might have to leave the variable out. But I don’t think that is necessarily the right thing to do…However, too much missing data will introduce bias which will throw off our predictions… This link also says:

“The mice package in R, helps you imputing missing values with plausible data values. These plausible values are drawn from a distribution specifically designed for each missing datapoint.”

This link says that we can impute by the means/medians if the missing values only account for 5% of the sample. Peng et al.(2006) suggest that mean imputation is permissible only if no more than 20% of the data is missing. I think we should impute because it uses a distribution, so the prediction is more robust. We can use the MICE package.

3 problems with mean imputation

Unconditional mean imputation results in an artificial reduction in variability due to the fact that you are imputing values at the center of the variable’s distribution. Also see this: 3 problems with mean imputation

We’ll probably use predictive mean matching “pmm” because it is good for quantitative variables. Mice circumvents the shrinking of errors by using multiple regression models. The variability in the different imputed values gives us a higher, but more correct standard error. Uncertainty is inherent in imputation so we are accounting for it through multiple imputed values (Source).

As long as the dataset is large enough, we can use MICE’s PMM. Also:

“Another simulation study that addressed skewed data concluded that predictive mean matching ‘may be the preferred approach provided that less than 50% of the cases have missing data and the missing data are not MNAR’ (Marshall et al. 2010)” (Source)

One of the variables is missing 34% (TEAM_BASERUN_CS). Let’s create two models. One with this variable kept in and imputed. Another with this variable taken out. If the P-values are low for both models, we can than use the evaluation data and see which model has the lower RMSE…

Multiple Imputation in R - Multiple imputation doesn’t like variables that are highly correlated with each other. We might have to take out some of the highly correlated variables…But taking out the highly correlated variables means that we might be leaving out some important information inherent in the data…So we might do a linear combination of the correlated variables in order to reduce the correlation….

I think that we should transform to a normal distribution first using Box-Cox, and then impute. And then we can

Linear Model with all Predictors

Top of page 158 in A Modern Approach to Regression describes the following for an example on food prices:

“Assuming that condition (6.6) holds we next look at plots of standardized residuals against each predictor (see Figure 6.2). The random nature of these plots is indicative that model (6.8) is a valid model for the data.”

Therefore, let’s assume that we have a linear model fitting all of the predictors and view the plots of standardized residuals against all of the predictors.

Note that when using all of the variables, observations with any NAs are omitted.

lmod <- lm(TARGET_WINS ~ ., moneyball_training_data)
standard_res <- rstandard(lmod)

for (predictor in colnames(moneyball_training_data[-1])){
  plot(na.omit(moneyball_training_data)[[predictor]],
       standard_res,
       xlab = predictor,
       ylab = "standardized_residuals")
}

Using Evaluation Data to Determine RMSE

TBD

VIF of Linear Model with all Predictors

vif(lmod)
##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##     1.171824e+05     1.685623e+00     1.302198e+00     3.074804e+05 
##  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS 
##     1.962853e+05     1.941752e+05     1.950069e+00     1.914415e+00 
## TEAM_BATTING_HBP  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB 
##     1.096334e+00     1.160417e+05     3.069624e+05     1.964039e+05 
## TEAM_PITCHING_SO  TEAM_FIELDING_E TEAM_FIELDING_DP 
##     1.946316e+05     1.256819e+00     1.097611e+00

The plot for TEAM_BATTING_2B doesn’t look random. It looks like that there is an outlier to the left of the plot. It’s hard to tell if the other plots follow a pattern or not…

plot(na.omit(moneyball_training_data)$TARGET_WINS, predict(lmod), xlab = 'y', ylab = 'y_hat')
abline(a = 0, b = 1)

sumary(lmod)
##                   Estimate Std. Error t value  Pr(>|t|)
## (Intercept)      61.410051  19.704264  3.1166  0.002139
## TEAM_BATTING_H    1.913476   2.761394  0.6929  0.489267
## TEAM_BATTING_2B   0.026388   0.030290  0.8712  0.384844
## TEAM_BATTING_3B  -0.101176   0.077507 -1.3054  0.193477
## TEAM_BATTING_HR  -4.843707  10.508511 -0.4609  0.645420
## TEAM_BATTING_BB  -4.459691   3.636241 -1.2265  0.221675
## TEAM_BATTING_SO   0.341963   2.598759  0.1316  0.895462
## TEAM_BASERUN_SB   0.033044   0.028673  1.1524  0.250708
## TEAM_BASERUN_CS  -0.011044   0.071431 -0.1546  0.877303
## TEAM_BATTING_HBP  0.082473   0.049600  1.6628  0.098152
## TEAM_PITCHING_H  -1.890957   2.760946 -0.6849  0.494317
## TEAM_PITCHING_HR  4.930432  10.506645  0.4693  0.639462
## TEAM_PITCHING_BB  4.510891   3.633720  1.2414  0.216120
## TEAM_PITCHING_SO -0.373645   2.597052 -0.1439  0.885767
## TEAM_FIELDING_E  -0.172042   0.041404 -4.1552 5.076e-05
## TEAM_FIELDING_DP -0.108192   0.036541 -2.9609  0.003494
## 
## n = 191, p = 16, Residual SE = 8.46704, R-Squared = 0.55
plot(lmod)

The fitted values plot above indicates that \(Y\) and \(\hat{Y}\) might not be linearly related. It looks like the slope should be less and the y-intercept should be higher… We therefore should do a Box-Cox transformation to overcome this nonlinearity.

Box-Cox Transformation

TBD

Try replacing the 0 values with really small values(1e-6) which will allow you to perform the Box-Cox transformation.

Backward and Forward Elimination

Let’s try the simple backward and forward elimination

step.model <- stepAIC(lmod, direction = "both", trace = FALSE)
summary(step.model)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HBP + 
##     TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP, data = moneyball_training_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.2248  -5.6294  -0.0212   5.0439  21.3065 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.01842   19.13272   3.241 0.001413 ** 
## TEAM_BATTING_H    0.02541    0.01009   2.518 0.012648 *  
## TEAM_BATTING_HBP  0.08712    0.04852   1.796 0.074211 .  
## TEAM_PITCHING_HR  0.08945    0.02394   3.736 0.000249 ***
## TEAM_PITCHING_BB  0.05672    0.00940   6.034 8.66e-09 ***
## TEAM_PITCHING_SO -0.03136    0.00728  -4.308 2.68e-05 ***
## TEAM_FIELDING_E  -0.17218    0.03970  -4.338 2.38e-05 ***
## TEAM_FIELDING_DP -0.11904    0.03516  -3.386 0.000869 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.422 on 183 degrees of freedom
##   (1583 observations deleted due to missingness)
## Multiple R-squared:  0.5345, Adjusted R-squared:  0.5167 
## F-statistic: 30.02 on 7 and 183 DF,  p-value: < 2.2e-16
step.model <- stepAIC(lmod, direction = "forward", trace = FALSE)
summary(step.model)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_BATTING_HBP + TEAM_PITCHING_H + 
##     TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP, data = moneyball_training_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.8708  -5.6564  -0.0599   5.2545  22.9274 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      61.41005   19.70426   3.117  0.00214 ** 
## TEAM_BATTING_H    1.91348    2.76139   0.693  0.48927    
## TEAM_BATTING_2B   0.02639    0.03029   0.871  0.38484    
## TEAM_BATTING_3B  -0.10118    0.07751  -1.305  0.19348    
## TEAM_BATTING_HR  -4.84371   10.50851  -0.461  0.64542    
## TEAM_BATTING_BB  -4.45969    3.63624  -1.226  0.22167    
## TEAM_BATTING_SO   0.34196    2.59876   0.132  0.89546    
## TEAM_BASERUN_SB   0.03304    0.02867   1.152  0.25071    
## TEAM_BASERUN_CS  -0.01104    0.07143  -0.155  0.87730    
## TEAM_BATTING_HBP  0.08247    0.04960   1.663  0.09815 .  
## TEAM_PITCHING_H  -1.89096    2.76095  -0.685  0.49432    
## TEAM_PITCHING_HR  4.93043   10.50664   0.469  0.63946    
## TEAM_PITCHING_BB  4.51089    3.63372   1.241  0.21612    
## TEAM_PITCHING_SO -0.37364    2.59705  -0.144  0.88577    
## TEAM_FIELDING_E  -0.17204    0.04140  -4.155 5.08e-05 ***
## TEAM_FIELDING_DP -0.10819    0.03654  -2.961  0.00349 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.467 on 175 degrees of freedom
##   (1583 observations deleted due to missingness)
## Multiple R-squared:  0.5501, Adjusted R-squared:  0.5116 
## F-statistic: 14.27 on 15 and 175 DF,  p-value: < 2.2e-16
step.model <- stepAIC(lmod, direction = "backward", trace = FALSE)
summary(step.model)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HBP + 
##     TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP, data = moneyball_training_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.2248  -5.6294  -0.0212   5.0439  21.3065 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.01842   19.13272   3.241 0.001413 ** 
## TEAM_BATTING_H    0.02541    0.01009   2.518 0.012648 *  
## TEAM_BATTING_HBP  0.08712    0.04852   1.796 0.074211 .  
## TEAM_PITCHING_HR  0.08945    0.02394   3.736 0.000249 ***
## TEAM_PITCHING_BB  0.05672    0.00940   6.034 8.66e-09 ***
## TEAM_PITCHING_SO -0.03136    0.00728  -4.308 2.68e-05 ***
## TEAM_FIELDING_E  -0.17218    0.03970  -4.338 2.38e-05 ***
## TEAM_FIELDING_DP -0.11904    0.03516  -3.386 0.000869 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.422 on 183 degrees of freedom
##   (1583 observations deleted due to missingness)
## Multiple R-squared:  0.5345, Adjusted R-squared:  0.5167 
## F-statistic: 30.02 on 7 and 183 DF,  p-value: < 2.2e-16